30 research outputs found

    Improving average ranking precision in user searches for biomedical research datasets

    Full text link
    Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant's best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system's performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system's performance. We believe that our solution could be used to enhance biomedical dataset management systems. In particular, the use of data driven query expansion methods could be an alternative to the complexity of biomedical terminologies

    Multilingual RECIST classification of radiology reports using supervised learning.

    Get PDF
    OBJECTIVES The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages. METHODS In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation. RESULTS The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks. CONCLUSIONS These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers

    The SIB Swiss Institute of Bioinformatics' resources: focus on curated databases

    Get PDF
    The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIB's resources and competence areas, with a strong focus on curated databases and SIB's most popular and widely used resources. In particular, SIB's Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article

    Assistance Ă  la curation de publications scientifiques par des mĂ©thodes de triage et d’annotation automatiques

    No full text
    La littĂ©rature est une gigantesque base de connaissances, non structurĂ©es, dans laquelle sont stockĂ©es les contributions sans cesse plus nombreuses de la communautĂ© scientifique. Par l’intermĂ©diaire de curateurs, les publications scientifiques sont annotĂ©es, contrĂŽlĂ©es et les entitĂ©s identifiĂ©es sont mises en relation avec d’autres sources de connaissances. Les curateurs ont aussi pour objectif de rendre l’ensemble des informations (trouvĂ©es ou crĂ©Ă©es) accessible et rĂ©utilisable pour la communautĂ©, d’oĂč la conception de bases de donnĂ©es spĂ©cifiques (telles que neXtProt). Cette thĂšse Ă©tudie diffĂ©rentes stratĂ©gies en recherche d’information et en fouille de donnĂ©es textuelles (amĂ©lioration du triage de documents via MEDLINE, reconnaissance d’entitĂ©s, extraction d’information, etc.) afin d’automatiser et de simplifier le processus global de curation. Le produit final de cette recherche, neXtA5, est un systĂšme optimisĂ© pour chaque Ă©tape du processus et intĂ©grĂ© dans la routine de ses utilisateurs afin de rĂ©pondre Ă  leurs attentes en terme d’utilisabilitĂ© (efficacitĂ©, efficience, satisfaction)

    Assistance Ă  la curation de publications scientifiques par des mĂ©thodes de triage et d’annotation automatiques

    No full text
    La revue de la littĂ©rature constitue une Ă©tape fondamentale de la recherche scientifique. En effet, l’exploration de mĂ©thodes et des rĂ©sultats existants, dans un domaine particulier, rĂ©pond Ă  plusieurs objectifs. Entre autres, elle permet d’identifier les informations pertinentes Ă  la rĂ©alisation d’un projet ou encore de mettre ses idĂ©es et conclusions en perspective avec les rĂ©alisations d’autres experts. Or, cette littĂ©rature est une gigantesque base de connaissances, non structurĂ©es, dans laquelle sont stockĂ©es les contributions sans cesse plus nombreuses de la communautĂ© scientifique. Dans ce contexte, le rĂŽle des curateurs consiste Ă  traiter la littĂ©rature au fur et Ă  mesure de sa production et Ă  assurer la fiabilitĂ© de l’information proposĂ©e. Par leur intermĂ©diaire, les publications scientifiques sont annotĂ©es, contrĂŽlĂ©es et les entitĂ©s identifiĂ©es sont mises en relation avec d’autres sources de connaissances. Les curateurs ont aussi pour objectif de rendre l’ensemble des informations (trouvĂ©es ou crĂ©Ă©es) accessible et rĂ©utilisable pour la communautĂ©, d’oĂč la conception de bases de donnĂ©es spĂ©cifiques. neXtProt est l’une de ces ressources, conçue et maintenue par le groupe CALIPHO de l’Institut Suisse de Bioinformatique dans le but de contribuer Ă  la comprĂ©hension des protĂ©ines humaines. Pour faire face Ă  l’augmentation spectaculaire de la quantitĂ© d’information produite par la recherche, tout en maintenant le standard de qualitĂ© de l’information proposĂ©e dans cette base, les curateurs de neXtProt ont dĂ©cidĂ© de mettre en oeuvre des mĂ©thodes d’automatisation du processus de curation en collaboration avec le groupe SIB Text-Mining. In fine, neXtA5 est une plateforme de support Ă  la curation de la littĂ©rature rĂ©sultant de cette collaboration

    Accueillir des publics LGBTIQ + dans les bibliothĂšques de Suisse romande: retours d’expĂ©rience des professionnel·le·x·s et des premier·Úre·x·s concerné·e·x·s

    No full text
    Cette recherche explore les pratiques d’inclusion des publics LGBTIQ+ des bibliothĂšques romandes. Elle repose sur un double constat. Tout d’abord, la persistance des discriminations subies par la population LGBTIQ+ en Suisse. Ensuite, l’absence de rĂ©flexion sur cette question au sein des associations professionnelles. Ce second point s’explique probablement par la conviction que l’absence de politique discriminatoire explicite exonĂšre la profession de tout reproche. De ce constat dĂ©coule la premiĂšre difficultĂ© de ce travail: rendre visible un impensĂ© et, dĂ©passer les mĂ©thodes de recherche ordinairement usitĂ©es afin de rendre compte de maniĂšre novatrice d’un problĂšme social encore trop souvent invisibilisĂ©. L’étude de la littĂ©rature acadĂ©mique et des productions professionnelles tĂ©moigne des rĂ©flexions en cours sur la fonction sociale des bibliothĂšques en gĂ©nĂ©ral et les questions que pose l’inclusion de certains publics en particulier. La discussion autour du concept d’inclusion implique ici un renversement de perspective et invite les bibliothĂšques Ă  s’adapter aux publics en travaillant Ă  ses cĂŽtĂ©s plutĂŽt qu’à sa place. ConcrĂštement, quatre points d’attention ont Ă©tĂ© identifiĂ©s: les collections, la mĂ©diation, l’accueil et la gouvernance. Pour chacun de ces points, il s’agit d’identifier les bonnes pratiques, existantes ou potentielles, et de mesurer leur adĂ©quation avec les attentes des publics concernĂ©s. Pour ce faire, 6entretiens avec des bibliothĂ©caire.x.s ont Ă©tĂ© menĂ©s, tandis que 6 personnes s’identifiant comme LGBTIQ+ et frĂ©quentant les bibliothĂšques ont acceptĂ© d’approfondir leur position lors d’un entretien. Ces entretiens ont Ă©tĂ© complĂ©tĂ©s par 3 entretiens avec des spĂ©cialistes des questions d’inclusion dans les bibliothĂšques, ainsi que 2 entretiens avec des spĂ©cialistes des questions LGBTIQ+. Enfin, un sondage a permis de recueillir le point de vue de 93 personnes s’identifiant comme LGBTIQ+ usager·Úre·x·s des bibliothĂšques. Cette approche, qui vise Ă  confronter les pratiques institutionnelles et professionnelles aux points de vue des publics s’inspire directement des mĂ©thodologies fĂ©ministes du stand point dĂ©veloppĂ©es en sciences sociales. De fait, l’analyse de nos donnĂ©es rĂ©vĂšlent que, si des mesures d’inclusion ont parfois Ă©tĂ© mises en place dans les bibliothĂšques romandes, ces pratiques demeurent marginales et sont le fait d’initiatives isolĂ©es de bibliothĂ©caires. L’inclusion des publics LGBTIQ+ ne semble presque jamais ĂȘtre une politique portĂ©e par les autoritĂ©s de tutelle ou par les directions. Les bibliothĂ©caires interrogĂ©.e.x.s font Ă©galement part d’un dĂ©ficit d’outils et de formations dans ce domaine. Le public concernĂ© exprime sa frustration face Ă  des institutions essentiellement hĂ©tĂ©rocisnormĂ©es. Si les personnes interrogĂ©es ne sont pas unanimes quant aux solutions Ă  apporter, elles identifient souvent les mĂȘmes problĂšmes. Afin de remĂ©dier aux sĂ©vĂšres lacunes en matiĂšre d’inclusion que cette recherche a permis d’identifier, on peut formuler des recommandations de trois ordres. Tout d’abord, agir sur le positionnement des bibliothĂšques. Puis, agir en tant que porte-parole au sein de la profession afin de rendre visible ces thĂ©matiques. Enfin, favoriser un accueil inclusif sur son lieu de travail

    The SIB Swiss institute of bioinformatics’ resources ::focus on curated databases

    No full text
    The SIB Swiss Institute of Bioinformatics (www. isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIB’s resources and competence areas, with a strong focus on curated databases and SIB’s most popular and widely used resources. In particular, SIB’s Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article

    BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2 ::Multilingual Information Extraction

    No full text
    BiTeM/SIB Text Mining (http://bitem.hesge.ch/) is a University re-search group carrying over activities in semantic and text analytics applied to health and life sciences. This paper reports on the participation of our team at the CLEF eHealth 2016 evaluation lab. The processing applied to each evaluation corpus (QUAREO and CĂ©piDC) was originally very similar. Our method is based on an Au-tomatic Text Categorization (ATC) system. First, the system is set with a specific input ontology (French UMLS), and ATC assigns a rank list of related concepts to each document received in input. Then, a second module relocates all of the positive matches in the text, and normalizes the extracted entities. For the CĂ©piDC corpus, the system was loaded with the Swiss ICD-10 GM thesaurus. However a late minute data transformation issue forced us to implement an ad hoc solution based on simple pat-tern matching to comply with the constraints of the CĂ©piDC challenge. We obtained an average precision of 62% on the QUAREO entity extraction (over MEDLINE/EMEA texts, and exact/inexact), 48% on normalizing this entities, and 59% on the CĂ©piDC subtask. Enhancing the recall by expanding the coverage of the terminologies could be an interesting approach to improve this system at moderate labour costs

    Designing retrieval models to contrast precision-driven ad hoc search vs. recall-driven treatment extraction in precision medicine

    No full text
    The TREC 2019 Precision Medicine Track repeats the general structure and evaluation of the 2018 track. Our team participated in both tasks of the track, relative to scientific abstracts and clinical trials. 40 topics where patient data are given (demographic data, disease, gene and genetic variant) were available for this competition. The aim was to retrieve scientific abstracts and clinical trials of interest regarding a topic, modelling the description of a clinical case. In the first task, we aim at retrieving scientific abstracts introducing some relevant treatments for a given case. Our system is first based on the collection of a large set of abstracts related to a particular case using various strategies such as search with keywords within abstracts, search with normalized entities within annotated abstracts and the linear combination of various queries. We then apply different strategies to re-rank the resulting scientific abstracts set. In particular, we tested two strategies to re-rank the abstracts set in order to have a large variety of treatments returned in the top articles. Almost two thirds of the top-10 returned documents are judged relevant, while nearly a quarter of the relevant treatments is returned in the top-10 abstracts. The second task aims at retrieving some clinical trials for which patients are eligible. Criteria used to determine the eligibility of patients are those found in the topics. Information such as trial location or status of clinical trials, which are important from a patient's point of view, are questionably not used in these topics. Several strategies have been tested, relaxing of constraints (data required or not), expansion of information requests thanks to synonyms or regex, and retrieval status value boosting for some criteria or fields. After judging, for almost half of the topics, a minimum of 50% of the documents retrieved are relevant, up to 90% for 10 of the 38 topics provided. Almost two thirds of the top-10 returned documents are judged relevant, while nearly a quarter of the relevant treatments is returned in the top-10 abstracts. Our best runs achieve highly competitive results depending on the measures, with on average being ranked #2 or #3 according to the official results for the literature task
    corecore